hyperparameter tuning
application to case study
interpretability evaluation
Görtler, J., Kehlbeck, R., & Deussen, O. (2019). A visual exploration of Gaussian processes. Distill, 4(4). doi:10.23915/distill.00017
Deisenroth, M., Luo, Y., & van der Wilk, M. (2020). A practical guide to Gaussian processes. https://infallible-thompson-49de36.netlify.app/.
Q: How do we choose kernel hyperparameters \(\theta = \left(\ell, \sigma_f, \dots\right)\)
A: Find \(\theta\) that makes the observed data \(\mathbf{y}\) most probable:
Marginal likelihood
\[p(\mathbf{y} | \mathbf{X}, \theta) = \int p(\mathbf{y} | \mathbf{f}, \mathbf{X}) p(\mathbf{f} | \mathbf{X}, \theta) d\mathbf{f}\]
This “marginalizes” over all possible functions \(\mathbf{f}\) with hyperparameter \(\theta\).
For the GP model, this has a closed form:
\[\log p(\mathbf{y} | \mathbf{X}, \theta) = -\frac{1}{2}\mathbf{y}^\top(\mathbf{K} + \sigma_n^2\mathbf{I})^{-1}\mathbf{y} - \frac{1}{2}\log|\mathbf{K} + \sigma_n^2\mathbf{I}| + \text{const}\]
This has two competing terms – data fit vs. model complexity.
Consider fitting a sine wave with noise. How would you choose the RBF \(\ell\)?
Very small \(\ell\) (underfitting)
Very large \(\ell\) (overfitting)
The right choice of \(\ell\) will be flexible enough to fit the sine curve but not the noise. The marginal likelihood finds this automatically without any cross-validation.
Observation = Signal + Noise
\[\begin{align*} y = f(x) + \epsilon \end{align*}\]
Variance Decomposition
\[\begin{align*} \underbrace{\operatorname{Var}[y]}_{\text {what we see }}=\underbrace{\sigma_f^2}_{\text {signal }}+\underbrace{\sigma_n^2}_{\text {noise }} \end{align*}\]
Heuristic 1: If the instrument’s precision is known (e.g., Kepler telescope noise \(\sigma_n \approx 10^-4\)), then set \[\begin{align*} \sigma_f^2=\operatorname{Var}[y]-\sigma_n^2 \end{align*}\]
Heuristic 2: Suppose the signal dominates by some assumed SNR factor \(\kappa\) (e.g., 2 - 100). Then set \[\begin{align*} \sigma_{f}^{2} \approx \operatorname{Var}[y]\\ \sigma_{n}^{2} \approx \frac{\sigma_f^2}{\kappa^2} \end{align*}\]
Lengthscale controls how far we travel before \(f(x)\) and \(f(x')\) decorrelate
Heuristic: Set lengthscale relative to spread of input data \[\ell \approx \lambda \cdot \text{SD}[\mathbf{X}], \quad \lambda \in [0.2, 10]\]
Alternative: Use median distance from mean \[\ell \approx \text{median}\{|x_i - \bar{x}|\}\]
Compiled: https://go.wisc.edu/yiu322
Respond to [GP True/False] parts (a - c) in the exercise sheet.
Select all the TRUE statements about GPs below.
TRUE FALSE The posterior variance at a test point x* is always smaller than the prior variance at that point, regardless of where the training data are located.
TRUE FALSE Multiplying two kernels \(k^{\text{new}}(x, x') = k_{1}(x, x')k_{2}(x, x')\) always results in a valid positive semi-definite kernel.
TRUE FALSE The posterior variance at a test point \(x^*\) is always less than or equal to the prior variance at that point, regardless of the distances to the training data.
The kernel is a hypothesis about the data generating process.
What we learned
The GP is not just a curve fitting routine, it helps discover meaningful astrophysics.
GPs do well on all five… deep learning fails 2 - 5.
Linear regression
Gaussian processes
Both GPs and linear regression are interpertable, but in different senses.
Limitations:
Extensions: